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ABSTRACT 


The creation of crowd-sourced content in learning systems is a 
powerful method for adapting learning systems to the needs of a 
range of teachers in a range of domains, but the quality of this 
content can vary. This study explores linguistic differences in 
teacher-created problem content in ASSISTments using a 
combination of discovery with models and correlation mining. 
Specifically, we find correlations between semantic features of 
mathematics problems and indicators of learning and engagement, 
suggesting promising areas for future work on problem design. 
We also discuss limitations of semantic tagging tools within 
mathematics domains and ways of addressing these limitations. 
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1. INTRODUCTION 


As content is developed at scale for online learning systems, 
particularly systems that leverage content developed by large 
numbers of authors, it becomes important to distinguish between 
problems which are well-written and conducive to learning and 
those which are poorly worded or otherwise difficult to 
understand. Crowd-sourced content, where content is authored by 
a broader community [21], is a powerful and scalable method of 
content creation, which can be used to quickly develop and deploy 
new content and curricula ([46], [17]). 


For this reason, it is critical that an equally scalable method of 
analyzing problem quality be developed, to prevent learning 
platforms that leverage crowd-sourced content from becoming 
dominated by ineffective content. In other platforms such as 
Wikipedia the quality of crowd-sourced materials is improved 
through substantial coordination between contributors [20]. 
However, there is relatively little work evaluating crowd-sourced 
learning content at scale. In contrast with more traditional 
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educational measurement (from tests), where determining items’ 
ability to discriminate student knowledge is a standard part of 
item analysis [11], there has been less attention to this problem for 
online learning systems. While some researchers have attempted 
to determine which hints are more effective [18], or which 
problems are associated with more learning [14], these efforts 
have focused on what, but not why, particular system features can 
impact student, limiting their degree of general use. A more 
theoretical approach was taken by [49] where a design space of 
over 70 features characterizing Cognitive Tutor lessons was 
distilled and correlated with an automated gaming the system 
detector. However, this work identified the characteristics of tutor 
lessons using hand-coding, a method that is infeasible for larger 
datasets, and was limited to the relatively narrow space of 
problems designed by professional educational developers. 


An alternative method for the analysis of the design of content in 
large-scale educational systems is text mining. There is a 
considerable amount of small-scale research on linguistic features 
that impact reading in mathematical contexts [47], but as [16] 
point out, many of the traditional readability indices used to study 
language at scale are limited in the features they consider. As a 
result, many early studies did not find a relationship between 
readability and performance in mathematics word problems [48]. 


As more advanced linguistic tools have become available, large- 
scale investigations of mathematics language have become more 
fruitful. For example, [44] have used LIWC [37] and CohMetrix 
[15] to study the effects of linguistic properties of mathematics 
problems ([44], [45]). [45] found that third-person singular 
pronouns (e.g., he, she) are significantly associated with correct 
answers and fewer hint requests in Cognitive Tutor problems. 
They found positive correlations between the use of work-related 
terms and learning, and negative correlations between the use of 
terms related to social constructs and learning. These findings 
highlight the potential value of linguistic features for better 
understanding learning, as well as the need to explore a wider 
range of semantic categories in a broader range of mathematics 
content areas. 


In this paper, we use a discovery with models approach, 
generating prediction labels from automated detectors of student 
learning and engagement that were developed for the 
ASSISTments online learning system ((2], [32]). We build on 
[46]’s approach of using text mining software and text elements, 
such as HTML tags and Unicode characters, to distill features 
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from a corpus of mathematics problems. We then use correlation 
mining approaches to identify links between these features and 
our labels of student engagement and learning as a means for 
determining which combinations of linguistic features are 
associated with particularly effective problems. 


1.1 ASSISTments 

The current study uses data collected from the ASSISTments 
system. ASSISTments is an online intelligent tutoring system 
used by over 50,000 students annually for middle-school 
mathematics. It provides both formative and summative 
assessment as well as extensive student support (assistment) and 
detailed teacher reports. It also facilitates research using 
randomized controlled trials (RCTs) that allow researchers to 
conduct studies without interfering with instructional time [17]. 


Within the system, students are assigned problem sets that may 
vary on several dimensions. Problem sets can be differentiated in 
terms of how problems are assigned: (a) In Complete All problem 
sets, problem order may be randomized; students must correctly 
answer all of the questions assigned and cannot advance to the 
next problem unless they have answered correctly. (b) In Jf-Then- 
Else problem sets, students must correctly answer a specified 
percentage of questions correctly (default is 50%) in order to pass, 
or else they may be given additional problems. (c) Finally, in Skill 
Builder problem sets, students must get 3 consecutive correct 
answers in order to pass, thus allowing students who show 
mastery to move on quickly to new assignments while providing 
struggling students with extended practice. 


The purpose of the current study is to evaluate the semantic 
properties and HTML metadata (which may carry semantic 
meaning) of problems authored in ASSISTments. Many have 
been vetted by the ASSISTments expert team, but others (76% as 
of 2014) were created by teachers themselves [17]. ASSISTments 
provides scripted templates, which allow teachers to customize 
problem sets for specific topics. Therefore, finding ways to 
identify meaningful differences in teachers’ problem design is an 
important area of research. 


2. DATA & METHODS 


In this paper, we analyze 179,908 problems within the 
ASSISTments system, most developed by teachers. We study 
these problems using the features of the problems themselves, in 
combination with data from the log files of 22,225 students who 
used ASSISTments during the 2012-13 school year. We applied 
models from previous research on engagement and learning to 
these students’ log files in order to determine how these constructs 
are associated with features of the design of the problems, 
developed through linguistic analysis and other data about the 
problems. In doing this, we excluded from consideration features 
that had been previously used within the learning and engagement 
models described below, to prevent overfitting. 


2.1 Learning & Engagement Measures 
Learning and engagement were assessed automatically, using 
detectors or models of these constructs. 


2.1.1 Student Learning 

Student learning was assessed by fitting the moment-by-moment 
learning model to the data [2]. The moment-by-moment learning 
model (MBMLM) attempts to infer the specific effect of each 
learning opportunity on a student’s overall mastery. We used [2]’s 
look-ahead-two probabilistic approach, which assumes that 
learning can occur at multiple points along a student’s trajectory 


of learning a skill, rather than [43]’s approach which assumes a 
single moment of learning. We also choose this formulation 
because it explicitly analyzes future performance, allowing us to 
focus on cases where students perform better than expected after 
encountering a particular problem. Using the MBMLM allows us 
to isolate the average learning associated with specific problems 
within the data and compare these averages to other problems that 
either lack or have particular features of interest. 


2.1.2 Automated Detectors of Engagement 

Detectors of student engagement were developed using data from 
in situ classroom observations, conducted by experts certified in 
the Baker Rodrigo Ocumpaugh Monitoring Protocol (BROMP 
2.0). The protocol is enforced by HART, an Android application 
designed specifically for the BROMP and freely available for 
non-commercial research [33], which enforces the protocol while 
facilitating data collection. 


Upon completion of the observations, data mining techniques 
were then employed to provide models of each construct that were 
cross-validated at the student level. In this paper, affective models 
developed for three different populations of students were applied, 
matching urban, suburban, and rural models to student data based 
on the location of their schools, in order to ensure population 
validity [32]. A detailed description of the features and algorithms 
used in these detectors is given in [32] and [34]. 


2.1.3 Applying Across-Student Measures of Learning 


& Engagement to Individual Problems 

In this paper, both the MBML model and the engagement models 
were used as indicators of problem effectiveness. This section 
describes how these models were aggregated across the 179,908 
problems and 22,225 students in this study. The formulation of the 
MBMLM in [2] is calculated once for each problem, at the time of 
the first attempt, and there is only one estimate per problem. 
Therefore, MBML was estimated for each student based on the 
sequence in which the problem was seen. Problem-level measures 
were then produced by averaging the MBML values across all 
students who saw a given problem. 


The affective models were applied by segmenting the data at 20- 
second intervals (matching the original approach used to develop 
the detectors), and then applying each model to each segment. 
Confidence values for each detector was averaged twice at the 
problem level: first for each student (in order to avoid biasing the 
estimates in favor of the affect experienced by students who spent 
longer working the problem), then across all students who had 
seen that problem. This resulted in five measures per problem 
(average boredom, confusion, engaged concentration, frustration, 
and gaming), which we used, along with MBMLM outcomes, as 
our dependent variables. 


2.2 Feature Engineering 

A number of different design features may influence student 
learning and engagement. In this paper, we explore features of 
both the problem text and its meta-text. Specifically, we look at 
word counts, lexical category features generated by a semantic 
tagger, and features generated from the metadata connected to the 
problem, which provides us with a separate source of semantic 
data (e.g., the use of mathematical notation which would not be 
captured by a semantic tagger) as well as with information about 
its use of tables, images, formatting, bolded or emphasized text. 


2.2.1 Wmatrix Semantic Tags 
The semantic content of ASSISTments problems was analyzed 
with Wmatrix [39], a corpus analysis and comparison tool that 
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parses text at a word and multi-word level. As of 2004, this 
included 42,300 single word entries and over 18,400 multi-word 
expressions [38]. Wmatrix has been used in a number of analyses, 
including work to tag and identify lexical patterns in ontology 
learning [13] and work to study how students self-explain when 
learning science content [12]. Its semantic tagger uses a semi- 
hierarchical structure where all known words and multi-word 
units are classified into one of 21 lexical fields, represented with 
letters by its tagging system. These lexical fields may (or may not) 
be further subdivided in up to three different levels, which are 
represented in what we will refer to as the base tag. 


Figure 1. WMatrix tagging system. 
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Within the lexical tag, we will refer to the lexical field 
(alphabetical) and the 1‘, 2", and 3™ order subfields (numeric) as 
the base tag. Additional information about antonyms (black vs. 
white), comparatives (better, worse, more confusing, etc.), 
superlatives (best, worst, most confusing, etc.), gender (masculine, 
feminine, and neuter), and anaphoric status (1.e., contextual 
reference), may or may not be appended to a base tag. Wmatrix 
documents 234 distinct base tags, and represents a large number 
of additional possible labels through appendices 


In the ASSISTments data, 442 distinct Wmatrix tags (base + 
appendices) were identified. These tags were most likely to fall 
under 7 lexical fields: General & Abstract Terms (A), Numbers & 
Measurement (N), Social Actions, States, & Processes (S), 
Psychological Actions, States, & Processes (X), Names & 
Grammatical Words (Z), Money & Commerce in Industry (I), and 
Time (T). 


2.2.2 Accommodating Known Wmatrix Limitations 
Although Wmatrix has been evaluated for its effectiveness in a 
range of genres, domains, and historical periods [38], semantic 
taggers can have a number of limitations when applied to highly 
specialized domains ((28], [24]; [36]; [30]; [27]). For example, 
research has shown that words which contain more than one unit 
of meaning create challenges for taggers that apply only one label 
per word [41]. As a result, semantic taggers which work 
specifically with scientific language have become an area of 
research interest ([{1], [10]), but the language of mathematics has 
not yet been as prominent. 


As such, features generated by Wmatrix must be carefully 
checked within this data set and may need to be supplemented by 
domain-specific tags. For example, we found several Wmatrix 
tags that erroneously tagged high-frequency items that appeared in 
ASSISTment’s instructions to students, including problems that 
instructed students to enter fractions in a specific format in order 
to receive credit or which told students that they had 3 attempts 
left. Wmatrix treated many of these words (e.g., enter and left) as 
an indication of physical movement (M1, as in entering a building 
or turning left). A few erroneous tags also appeared to result from 
the development of Wmatrix as a tool for British English. For 
instance, ASSISTments users, who are primarily American 
English speakers, wrote a number of problems involving a person 


named Randy, whose name was automatically (and erroneously) 
tagged as involving sexual content. 


To mitigate this issue, significant correlations were carefully 
inspected individually. This approach has been found to be useful 
in previous studies where semantic taggers were applied to new 
domains [12]. While the large size of the ASSISTments corpus 
limits our ability to address this problem completely, thorough 
efforts were made to examine and understand relationships 
discovered through the use of Wmatrix. In instances where 
Wmnatrix applied a tag involving the wrong sense of a word for the 
context in which it was used, we have specifically noted this 
difference and what sense of a word or words the tag is capturing 
within ASSISTments. 


2.2.3 Math Symbols and Other Textual Metadata 

In addition to generating features with Wmatrix, we also 
generated features based on the metadata of each problem. We 
were primarily concerned with identifying Unicode characters that 
are semantically meaningful in mathematics contexts. In the 
ASSISTments corpus, we labeled 68 symbols, such as those for 
integrals, mean, standard deviation, and exponents. These 
domain-specific symbols present unique challenges to the 
teaching and learning of mathematics [40], but are not detected by 
most lexical analysis tools, which have not generally been 
developed for mathematics domains. In addition, we identified 14 
HTML tags that were used to format ASSISTments problems, 
including tags used for boldface, italics, paragraph structure, and 
images. Because many of these functions can also alter the 
semantics of a problem, we also generated features that reflect 
these uses of HTML in problem metadata. These features were 
generated by counting the number of times that each HTML code 
was used in a problem, in parallel to the application of the 
Wmatrix tags discussed in previous sections. 


3. RESULTS 


To explore the relationship between these problem features and 
the BROMP-trained measures of engagement and learning, we 
correlated each problem feature to each predicted variable. We 
selected Spearman’s p as our correlation coefficient because of its 
increased robustness when correlating non-normal data as 
compared to other parametric coefficients such as Pearson’s R 
[50]. Additionally, with such a high number of comparisons being 
conducted it was necessary to adjust our significance criterion to 
account for the possibility of tests being incorrectly identified as 
significant. The Benjamini and Hochberg post-hoc procedure [4] 
was used to control for these false discoveries. A table of results 
by dependent variable is presented in Table 1, which also provides 
the average confidence level for each detector as a baseline 
measure for this data. 


Table 1. N of significant features by outcome measure. 


Sig w/ Sig w/ 

Avg Total |p| > |p| > 
Outcome Measure Conf. Sig 0.05 0.10 
Bored 0.16 118 16 0 
Engaged Concentration 0.46 251 62 14 
Confusion 0.03 285 60 5 
Frustration 0.04 216 36 7 
Gaming the System 0.02 257 43 5 


Of the possible 2730 correlations, 1127 (41.3%) were statistically 
significant after controlling for multiple comparisons using 
Benjamini & Hochberg’s post-hoc control. More features were 
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significantly correlated with confusion than any other outcome 
measure, but large numbers of features were also correlated with 
gaming the system, engaged concentration, frustration and 
MBML. Boredom was correlated with fewer features, overall, 
than either of the other outcome measures. These broad findings 
suggest the potential for finding semantic features that may help 
to provide templates for improving the design of word problems. 


3.1 Features associated with all outcome measures 
In the following sections, we examine the relationships between 
our features and the individual outcome measures, but in order to 
provide a broad summary of which types of features had the 
largest effects, the absolute value of Spearman p was averaged 
across all six outcome measures for each feature in this study. 
Among the 64 features that were signifcantly correlated with all 
six outcomes, the 10 with the highest p average (shown in Table 
2) were drawn from 5 lexical fields: Grammatical Bin (Z), 
General Terms (A), Time (T), Speech Acts (Q), and Numbers & 
Measurement (N). One HTML tag (<p>, paragraph) was also 
significant. 


Table 2. 10 largest correlated features by average sig. |p| 


c 
© = ° 
Ave 2 5 2 3 = we 
Tag | st 3 g S 5 £ 
Wls63ehC~SSétéC“‘<‘i‘i‘i‘~zaS:SC<C<*i‘ SC‘*S 
=3 a 8s 8 ira cc) 
z5 0.116 0.193 0.086 -0.165 0.084 0.105 0.060 
z5mwu 0.104 0.114 0.034 -0.040 0.135 0.162 0.140 
A12- 0.101 0.114 -0.027 0.030 0.086 0.153 0.198 
T3- 0.091 0.084 -0.034 0.055 0.074 ~=—-0.144 0.153 
Q2.2 0.080 0.043 0.083 -0.162 0.068 0.071 0.051 
71.1.2 0.076 0.076 -0.051 0.031 0.067 0.116 0.116 
<p> 0.071 0.149 0.054 -0.127 0.015 0.064 = -0.015 
N1 0.069 0.061 0.076 -0.077 0.082 0.080 0.035 
AS.4+ 0.066 -0.028 0.059 -0.130 0.074 0.038 —--0.069 
76 0.056 0.108 0.020 -0.034 0.077. -0.032. 0.071 


Spearman’s p is also shown for individual outcome measures, 
allowing us to examine the effects of these features in greater 
detail. Table 2 shows that WMatrix’s Speech Acts tag (Q2.2, e.g., 
answer, account, or speak out) is correlated with small increases in 
learning, but is also positively correlated with increased boredom 
and gaming and decreased concentration. The Wmatrix features 
described as Grammatical Bin (words such as as, but, in order to) 
are also correlated with increased learning, boredom, and gaming. 
Correspondingly, they are also negatively associated with engaged 
concentration, illustrating the complicated interactions at play in 
this data and the importance of considering multiple outcomes 
when exploring design effects. 


4. Results by Outcome Measure 

While some interactions are complicated, we also see many 
features correlate in logical patterns. For example, features that 
are positively associated with boredom are often also negatively 
associated with engaged concentration, and vice-versa. Likewise, 
features associated with confusion are also associated with 
frustration. The remainder of this section discusses these patterns 
in greater detail, pairing outcome measures that are conceptually 
related (e.g., boredom and engaged concentration as well as 
MBML and gaming the system, which have shown to be inversely 
related in the past). Specifically, we will examine the ten features 
that are most negatively associated and the ten that are most 
positively associated with each outcome measure, discussing 
commonalities across outcome measures. 


4.1.1 Learning & Gaming the System 

The Spearman p values for the top ten features range from -0.078 
to 0.233 for MBML and from -.095 to 0.198 for gaming the 
system. Table 3 presents these results, highlighting features that 
correlate with both outcome measures. 


Table 3. Features most strongly associated with MBML and 
aming the system 


LEARNING GAMING 

TAG Semantic Description p |TAG Semantic Description p 

AS.2+ True/False -0.078 |N5+ Quantities -0.095 
s9 Religion & the supernatural -0.075 | A10+ Open/Closed; Hiding/Hidden; Findin -0.092 
All1.1+++ Important/Significant -0.066 | X2.1 Thought/belief -0.084 
A6.1+ Similar/Different -0.062 | A2.1+ Modify, Change -0.082 
G2.2+ General Ethics -0.059 | S5+ Groups and affiliation -0.074 
N3.2+++ Measurement: Size -0.059 |N5.2+ Exceeding; waste -0.070 
A3- Being -0.058 | A5.4+ Authenticity -0.069 
Z8mwu Pronouns etc. -0.054 |T1 TIME GENERAL -0.069 
Nimwu Numbers -0.051 NS Quantities -0.067 
X5.2+ Interest/boredom/excited/ennergetic -0.049 |T2+ Time: Beginning and ending -0.067 
Al2- Easy/Difficult 0.114 |A7+mwu __ Definite (+modals) 0.086 
ZS5mwu Grammatical bin 0.114 |X2.4mwu __ Investigate/examine/test/search 0.087 
Z99 Unmatched 0.114 |N3.8+ Measurement: Speed 0.093 
N3.3--- Measurement: Distance 0.115 Z8 Pronouns etc. 0.093 
X2.2+ Knowledge 0.121 |A12+++ — Easy/Difficult 0.098 
M7 Places (geographical & conceptual) 0.130 |T1.1.2 Time: General: Present; Simultaneou: 0.116 
N3.8+ Measurement: Speed 0.142 |X8+ Trying 0.140 
<p> HTML paragraph 0.149 |ZSmwu Grammatical bin 0.140 
Z5 Grammatical bin 0.193 |T3- Time: Old, new and young; age 0.153 
MI Moving, coming, & going 0.223 |A12- Easy/Difficult 0.198 


Although gaming is an infrequent behavior, previous research has 
shown that it is linked to poorer learning ([7], [34]). Therefore the 
findings in Table 3 are somewhat surprising. We should expect 
gaming’s infrequency to limit overlap between the two categories, 
and expect them to show inverse relationships when present. 
Instead, A12- (words related to difficulty), ZSmwu (multiword 
grammatical units like as far as or for example), and N3.8+ 
(words related to higher speeds), are all associated with increased 
MBML and increased gaming behaviors. Likewise, semantically 
similar categories like Nlmwu (multiword numbers) and N5+ 
(large quantities) are associated with lowered MBML and 
lowered rates of gaming behaviors. 


These anomalies might be due to the existence of problems that 
support learning but can be gamed relatively easily, or might 
suggest that particularly challenging problems lead to learning but 
also inspire gaming behavior. For example, A5.2+ (words 
associated with true) demonstrates the lowest correlation with 
learning, a result that is consistent with literature on the 
ineffectiveness of true/false questions [42]. Likewise Z8mwu 
(multiword pronouns, e.g., anything at all) is correlated with 
lower MBML, while Z8 (single word pronouns, e.g., it, my, and 
you) is correlated with increased gaming. These findings align 
with research showing that pronouns can be difficult to process 
cognitively (taxing working memory), as they require readers to 
infer their antecedents (the words that give them their meaning) 
from context ([25], [8], [22], [6]). This suggests that pronouns 
could inhibit learning by drawing mental resources away from 
mathematics task, perhaps inspiring some students to try to 
succeed with minimal cognitive effort. 


These findings highlight important considerations for researchers 
working to improve learning systems, including the need to 
consider multiple measures. For example, [44] found that 
pronouns are associated with correct answers and lowered hint 
use. It is highly likely that pronouns can have beneficial impacts 
on learning, particularly through [44]’s hypothesized mechanism 
of increased cohesiveness. However, if pronoun use in 
ASSISTments and Cognitive Tutor is comparable, our results 
suggest that some correct answers could have been achieved by 
guessing rather than by learning. 
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Furthermore, if students are more tempted to game the system 
when presented with challenging problems, even though these are 
exactly the sort of problems needed to improve learning, then 
further research should explore whether or not these findings 
reflect two distinct different groups of students. It may be that 
some students need additional cognitive scaffolding or a 
motivational intervention in order to complete these problems 
without gaming, allowing them to learn as well as other students 
who are working through the curriculum in a more appropriate 
way. However, research has also shown that in some cases high 
achieving students also game the system, and the independent 
application of these models could be picking up on that trend, 
where students guess something that they actually know, but then 
correct this behavior in subsequent problems, which could cause 
the MBML model to perceive learning. 


4.1.2 Confusion & Frustration 

Confusion and frustration show considerable overlap, in line with 
prior theory on the relationship between these constructs ((9], 
[26]). As Table 4 shows, half (10) of the semantic features most 
strongly associated with one are also strongly associated with the 
other, including N6mwu (frequency of occurrence) which is 
negatively associated with both confusion and frustration. This 
corresponds with [44]’s findings that clear demarcations of time 
in mathematics problems can improve student outcomes. 


Table 4. Features most strongly associated with confusion and 


frustration 
CONFUSION FRUSTRATION 

TAG Semantic Description p |TAG Semantic Description p 
X2.1 Thought/belief -0.149 |X2.1 Thought/belief -0.110 
Z6 Negative -0.101 [N5S+ Quantities -0.070 
N3.4 Measurement: Volume -0.097 JA11.1+++ Imp ortant/Significant -0.063 
N3.3--- Measurement: Distance -0.079 |N3.4 Measurement: Volume -0.061 
N6omwu Frequency of occurance -0.079 |A2.2 Cause, Connected -0.056 
A22 Cause, Connected -0.077|N6émwu Frequency of occurance -0.052 
ALS5.1 Using -0.076 |X4.2 Means, method -0.051 
N5+ Quantities -0.070 |T2++ Time: Beginning and ending -0.050 
11.3 Money: price -0.068 |A2.1+mwu Modify, Change -0.049 
04.1 General Appearance/Phys'l Proper -0.066 | <font> HTML font adjustment -0.049 
Q1.2mwu Paper documents & writing 0.081 JI3.1 Work & Employment: generally 0.089 
NI Numbers 0.082 |X2.4mwu _ Investigate/examine/test/search 0.092 
13.2 Work & Employment: professiona 0.083 |<span> HTML span (grouping of items in or 0.092 
Ze: Grammatical bin 0.084 |N6+ Frequency of occurance 0.093 
Al2- Easy /Difficult 0.086 {ZS Grammatical bin 0.105 
<em> HTML italics 0.087 |T 1.1.2 Time: General: Present; simultaneous 0.116 
13.1 Work & Employment: generally 0.094 |T3- Time: Old, new and young; age 0.144 
S6+ Obligation and necessity 0.105 |X8+ Trying 0.148 
X8+ Trying 0.115 JA12- Easy /Difficult 0.153 
ZSmwu Grammatical bin 0.135 |ZSmwu Grammatical bin 0.162 
Notable semantic features within this pairing include Z5 and 


Z5mwu. Both capture what are known as grammatical bin, which 
includes prepositions (of, to, after, amid), conjunctions (and, or, 
but), certain adverbs (e.g., as, so, which, than, when), the 
infinitival maker (to + verb), determiners (e.g., a and the) and 
certain auxiliary verbs (e.g., do). Previous research has suggested 
that the highly specific style of scientific language increases the 
use of these parts of speech, especially in the sort of definitional 
contexts that we might find in many learning contexts [3]. [29], 
for example, notes that students sometimes struggle with 
prepositions. In fact, this pattern is sometimes referred to as the 
stylistic barrier hypothesis [31], which suggests that differences 
between the language students use at home and the language used 
in the classroom may interfere with the learning process. 


HTML features that that correlate with confusion and frustration 
match findings in the literature. For example, [35] suggest that 
italics are difficult to read, and our findings show that they are 
correlated with higher confusion. Changes in font size, however, 
are associated with lower frustration; it is possible that teachers 
are using changes in font size to clarify visual hierarchy and 
problem meaning. 
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Features associated with concreteness (N3.4, N3.3, A2.2, A1.5.1, 
NS+, 11.3, 04.1, T2++) correlate with lowered confusion and 
frustration, matching the literature on the concreteness effect, 
which shows that concrete words are not only processed faster 
than abstract words in many experimentally controlled studies 
[23], the two may operate in separate neurological pathways ([19], 
[5]). These findings are hypothesized to be an artifact of the word- 
to-word mapping system the brain uses to process language, 
where concrete words may have stronger ties to more basic 
concepts. Interestingly, [23] have found evidence for similar 
pathways for emotion words, which are acquired early and 
considered quite basic to the human experience. While several of 
the Wmatrix categories that might correspond with [23]’s account 
of emotion words do not appear in this list (E3, E4, X4.1), X2.1, 
described as thoughts/beliefs, has the strongest negative 
associations with both frustration and confusion. 


Other features which correlate with increased confusion and 
frustration may reflect the sort of meta-instructions teachers use to 
support students working with complex mathematical problems. 
Consider, for example, the tags in the following examples: 


(1) You_Z8mf must_S6+ show_A10+ your_Z8 work_13.1. 

(2) You_Z8mfhave_A9+ three_N1 attempts_X8+ 

(3) Often_N6+ it_Z8 helps_S8+ to_Z5 write_Q1.2[i1.2.1 
down_Q1.2 [i1.2.2 your_Z8 work_13.1. 

(4) Keep_A9+ trying X8+ 

(5) Do_X8+[i1.3.1 your_X8+[i1.3.2 best_X8+[i1.3.3 

(6) Do_A1.1.1 the_Z5 difficult_A12- problems_A12- first_N4 


Several of these tags (as given in bold, above: I3.1 work; S6+ 
must; Z5 to, the; X8+ attempts, trying, A12- difficult; N6+ often) 
are correlated with increased confusion or frustration. This finding 
may reflect a preemptive scaffolding practice (e.g., teachers 
provide these additional instructions when students are working 
on problem types that they have struggled with in the past). 
However, it is important to rule out other possibilities. For 
instance, such additional instructions could distract or annoy the 
students. More seriously, it could also have priming effects. 


4.1.3 Engaged Concentration & Boredom 

Like confusion and frustration, we see considerable overlap in the 
features correlated with engaged concentration and boredom. 
However, unlike confusion and frustration, these two outcome 
measures are negatively associated with one another. Six of the 
features most negatively associated with concentration (NS5-, 
N3.6, Z5, Q2.2, A4.1, and A5.4+) are among those most 
positively associated with boredom. Likewise, four of those most 
positively associated with concentration (A2.1+mwu, A6.1+++, 
T3, and AS.2+) are negatively associated with boredom. 
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Table 5. Features most strongly associated engaged 
concentration and boredom 


ENGAGED CONCENTRATION BOREDOM 

TAG SEMANTIC DESCRIPTION TAG SEMANTIC DESCRIPTION p 

N5- Quantities T1L12 Time: General: Present; Simultan'us | -0.051 
N3.6 Measurement: Area JAS.2+ True/False -0.041 
Z5 Grammatical bin X2 Mental actions & processes -0.041 
Q2.2 Speech Acts |A2.1+mwu Modify, Change -0.034 
A41 Generally /kinds/ groups/examples -0. M6mwu Location & Direction -0.034 
<em> HTML italics T3- Time: Old, new and young; age -0.034 
A6.3+ Variety A8 Seem/Appear -0.030 
AS.4+ Authenticity T2++mwu_ Time: Beginning and ending -0.028 
<p> HTML paragraph A11.1+++ Important/Significant -0.027 
Z7 If |A6.1+++ _ Similar/Different -0.027 
A4.2+ Particular/general; details |AS.4+ Authenticity 0.059 
N3.5 Measurement: Weight Z8c Pronouns etc. 0.061 
N3.1 Measurement: General A6.3+ Variety 0.063 
S5+e Groups and affiliation NI Numbers 0.076 
A2.1+mwu Modify, Change S6+ Obligation and necessity 0.076 
A6.1+++ — Similar/Different N5- Quantities 0.078 
T3 Time: Old, new and young; age Q2.2 Speech Acts 0.083 
A2.1+ Modify, Change A4.1 Generally /kinds/ groups/examp les 0.085 
Yl Science/technology general Z5 Grammatical bin 0.086 
AS.2+ True/False N3.6 Measurement: Area 0.093 


Interestingly, X2.1 (thoughts/beliefs) is not as closely related to 
boredom and engagement as it was to confusion and frustration, 
but two other features typically associated with language about 
humans show desirable associations with these two outcome 
measures. For instance $5+c (groups & affiliation) is associated 
with increased engaged concentration, while X2 (mental 
actions/processes) is associated with lowered boredom. Likewise 
A8, which tags words related to seem or appear (both mental 
processes typically ascribed to human subjects), also leads to 
lowered boredom. 


These semantic features, along with several others that correlate 
with lowered boredom (T2++mwu time demarcations and 
M6mwu Jocation/direction) may also be indicators that problems 
with greater narrativity improve student engagement. However, 
we must still be cautious about interpreting lower boredom as a 
desirable effect in and of itself, since A5.2+ (words associated 
with true) is also associated with lower boredom. This type of 
item is unlikely to bore students, since they can answer and pass it 
quickly. However, readers may recall that this feature is also 
correlated with lower learning, as one might expect based on 
previous research on True/False questions [42]. 


5. DISCUSSION AND CONCLUSIONS 


Our analyses of the ASSISTments corpus complements previous 
research on the relationship between learning and the language of 
mathematics problems, but extends this line of inquiry by 
including educationally relevant behaviors and affective states as 
part of the learning outcomes measured. As discussed, a number 
of linguistic features (e.g., pronouns, mental states, time, and 
concreteness) have been found to be significant in previous work. 
However, we were also able to examine the degree to which these 
relationships reflect expectations about how behavior, affect, and 
learning are related. 


For instance, some of the same features which were correlated 
with learning were also correlated with student frustration and 
gaming the system. While it might be hypothesized that frustrated 
students would be more likely to game the system, there is also 
evidence from within ASSISTments that frustration can be 
important for learning [26]. The MBML model used here is a 
look-ahead algorithm, which may optimize the opportunity to 
identify the problems that trigger learning even when learning 
process is causing student frustration. However, it’s also possible 
that these problems are triggering strong but distinct reactions in 
different students (e.g., students who persist vs. students who 


game the system when they become frustrated). Future work will 
hopefully shed more light on this unusual relationship. 


Overall, these results point to a number of promising avenues for 
further research within the ASSISTments system. One key future 
approach will be to conduct RCTs of the features identified in this 
study, re-designing problems to eliminate problematic features or 
incorporate positive features, in order to determine whether our 
findings can drive enhanced design. At the same time, it will be 
important to explore some of the interactions that may exist 
between different combinations of linguistic features, or between 
linguistic features and other behaviors or actions within the tutor. 


We also found several unusual patterns in our data, such as some 
features being associated with increases in both learning and with 
gaming the system. We believe this may be due to our dataset 
containing two different populations of students — those who are 
persistent in the face of challenging and difficult problems and 
those who are frustrated by these problems and attempt to game 
the system to avoid working through them. We hope to 
understand this relationship in greater detail through RCTs (as 
discussed below). Ultimately, we hope to use our findings to 
construct guidelines for teachers creating their own content in the 
system, which can be embedded directly into the authoring tools 
teachers use, providing useful feedback on their problem design. 


5.1 Randomized Controlled Trials 


Having found a set of features that are associated with differences 
in student engagement and learning, our next step will be to 
conduct a set of randomized controlled trials (RCTs) to test 
whether the effects we found are genuinely causal, and whether 
re-designing problems based on these findings can improve 
student outcomes. By determining which of these features are 
causal, we can expand scientific understanding of learning and 
engagement in online learning systems. By developing methods 
for concretely improving math problems, we can develop better 
guidelines and recommendations for the many instructors (and 
others) developing problems for the ASSISTments platforms. In 
the longer-term, we hope to make all of the problems in the 
ASSISTments platform engaging and educationally effective for 
each of the growing number of students who use ASSISTments to 
learn mathematics and other subjects. 


5.2 Continued Feature Engineering 

Another important area of future work will be to conduct further 
feature engineering, particularly in terms of text features specific 
to the language of mathematics. One of the shortcomings of the 
current study is that the language of mathematics is poorly 
modeled in existing tools. In addition to challenges cause by 
domain or context-specific uses of certain words, many semantic 
taggers rely on syntactic probabilities that may be difficult to 
capture when math problems are interspersed with text. Simply 
developing taggers that can identify embedded mathematics 
formulas (e.g., labeling ‘3+2’ as addition) could help to ameliorate 
this issue. We hope that, by developing more robust tools for the 
analysis of this particular corpus, we will be able to better predict 
and understand learning and engagement. 


As research progresses, features derived from combinations of 
Wmatrix tags will also become important since many of the sub- 
categories within and across Wmatrix’s lexical fields may be 
semantically similar enough, or co-occur frequently enough, to 
warrant combining them within ASSISTments data. For example, 
Wmatrix treats deciding as separate from choosing, selecting, and 
picking, but this division may not be useful in mathematics 
learning corpora. Likewise, feature combinations may help to 
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contextualize Wmatrix categories that are prone to incorrectly 
categorizing high-frequency words. For example, since many 
features in this study are highly correlated with M1, combinations 
involving this tag may be used to differentiate its use in 
instructions to students (e.g., “You have 3 attempts /e/t’”) from its 
use in physical descriptions related to geometry (e.g., “Jill turns 
left and walks 3 more miles.”). 


5.3 Directions for Future Work 


In this paper, we discovered relationships between semantic 
elements of text in the ASSISTments system and learning, 
affective, and behavioral student outcomes. In doing so, this work 
contributes to the emerging body of research studying the design 
of mathematics problems at scale. 


Our findings show that a large number of semantically meaningful 
relationships exist, some of which correlate with a wide range of 
learner outcomes. These features provide insights that will help to 
develop guidelines for effective problem designs in ITSs. 
However, the existing suite of tools available for large scale 
textual analysis may not be optimal for tagging the specialized 
language of mathematics found in the ASSISTments system. Thus 
an additional area for future work includes the development of 
semantic taggers that are more appropriate for mathematics 
corpora. These efforts will help us to better understand how the 
linguistic properties of math problems influence student success at 
scale. In turn, by exploring potential relationships between 
persistence and student perceptions of challenge, we can work to 
design mathematics problems that are both more informative and 
more engaging. 
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